Week 4 Exercises

yufan_yin_week4: 6.10. - 12.10.2020

Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/

Exercise 1

1.1 Reading the data

Read the region_scores.csv data

region_scores <- read.csv(file = "D:/Users/yinyf/datavis-R/week4/region_scores.csv", stringsAsFactors = TRUE)
region_scores <- region_scores %>%
  mutate(id = as.character(id),
         region = factor(region),
         education = factor(education, ordered = TRUE),
         gender = factor(gender))

glimpse(region_scores)
## Rows: 240
## Columns: 6
## $ id        <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", ...
## $ region    <fct> South Karelia, Satakunta, Kymenlaakso, South Karelia, Sou...
## $ education <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ gender    <fct> M, F, M, F, F, F, M, F, M, M, F, M, F, M, M, M, F, M, F, ...
## $ age       <int> 56, 41, 48, 41, 35, 60, 28, 28, 48, 51, 45, 55, 41, 24, 6...
## $ score     <dbl> 4.268811, 5.646586, 6.949019, 7.096777, 6.990985, 5.26766...

Cutting values (score) into intervals

to groups of width 10

region_scores %>%
  mutate(score_group = cut_width(score, 10, boundary = 0)) %>% 
  count(score_group)
##   score_group   n
## 1      [0,10]  55
## 2     (10,20] 154
## 3     (20,30]  31
region_scores <- region_scores %>%
  mutate(score_group = cut_width(score, 10, boundary = 0, 
                                 labels = c('-10','11-20','21-'))) 
region_scores %>% 
  distinct(score_group)
##   score_group
## 1         -10
## 2       11-20
## 3         21-

Column score_group is not found.

region_scores2 <- region_scores %>%
  group_by(education, score_group, .drop = FALSE) %>%
  summarise(mean_age = mean(age),
            sd_age = sd(age),
            n = n()) %>%
  ungroup()
## `summarise()` regrouping output by 'education' (override with `.groups` argument)
region_scores2
## # A tibble: 9 x 5
##   education score_group mean_age sd_age     n
##   <fct>     <fct>          <dbl>  <dbl> <int>
## 1 1         -10             39.5  10.1     46
## 2 1         11-20           38.8  10.2     39
## 3 1         21-            NaN    NA        0
## 4 2         -10             45     9.27     9
## 5 2         11-20           42.2   9.61    65
## 6 2         21-             39.3   7.57     3
## 7 3         -10            NaN    NA        0
## 8 3         11-20           40.1  10.4     50
## 9 3         21-             37.4   8.97    28

1.2 Histograms

Create a figure that shows the distributions (density plots or histograms) of age and score in separate subplots (facets). What do you need to do first?

Note: I’m not sure the group varible to create subplots.

In the figure, set individual x-axis limits for age and score by modifying the scales parameter within facet_wrap().

Question: What went wrong when I used facet_wrap() but saw the warning ‘Layer 1 is missing score_group(or other group variable)’ ? I met last week, too. I saved score_group.

region_scores %>%
  ggplot(aes(age, fill = score_group)) + 
  geom_histogram(position = "identity", alpha = .5, binwidth = 1) 

(Try more as a reminder in future)

region_scores %>%
  ggplot(aes(age, fill = gender)) + 
  geom_histogram(position = "identity", alpha = .5, binwidth = 1) 

region_scores %>%
  ggplot(aes(score, fill = gender)) + 
  geom_histogram(position = "identity", alpha = .5, binwidth = 1) 

1.3 Density plots

Note: I do not understand the meaning of y-axis in such density plots.

region_scores %>%
  ggplot(aes(age, fill = gender)) + 
  geom_density(alpha = .5) 

region_scores %>%
  ggplot(aes(score, fill = gender)) + 
  geom_density(alpha = .5) 

Exercise 2

In this exercise, you will use the built-in iris dataset.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

2.1 Make data into long format

Make the data into long format: gather all variables except species into new variables var (variable names) and measure (numerical values). You should end up with 600 rows and 3 columns (Species, var, and measure). Assign the result into iris_long.

iris_long <- iris %>%
  gather(var, measure, -Species) 
str(iris_long)
## 'data.frame':    600 obs. of  3 variables:
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ var    : chr  "Sepal.Length" "Sepal.Length" "Sepal.Length" "Sepal.Length" ...
##  $ measure: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

2.2 Spread: long-to-wide

In iris_long, separate var into two variables: part (Sepal/Petal values) and dim (Length/Width).

Then, spread the measurement values to new columns that get their names from dim. You must create row numbers by dim group before doing this.

You should now have 300 rows of variables Species, part, Length and Width (and row numbers). Assign the result into iris_wide.

Note: It was a bit complex than the example. I tried many times but failed. So I kept some of the codes in the following chunk.

iris_long %>%
  group_by(Species) %>%
  mutate(row = row_number()) %>%
  ungroup %>%
  spread(?, ?) %>%
  select(-row)

However,

Must extract column with a single valid subscript. x Subscript `var` has the wrong type `data.frame<Sepal.Width:double>`. i It must be numeric or character.

Or:

iris_long %>%
  pivot_wider(names_from = c(var),
  values_from = measure) 
## Warning: Values are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list` to suppress this warning.
## * Use `values_fn = length` to identify where the duplicates arise
## * Use `values_fn = {summary_fun}` to summarise duplicates
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>      <list>       <list>      <list>       <list>     
## 1 setosa     <dbl [50]>   <dbl [50]>  <dbl [50]>   <dbl [50]> 
## 2 versicolor <dbl [50]>   <dbl [50]>  <dbl [50]>   <dbl [50]> 
## 3 virginica  <dbl [50]>   <dbl [50]>  <dbl [50]>   <dbl [50]>

There is still error.

2.3 Scatter plot

Using iris_wide, plot a scatter plot of length on the x-axis and width on the y-axis. Colour the points by part.

iris_wide %>%
  ggplot(aes(Length, Width), color = Species) + # x = length, y = width
  geom_point()

Exercise 3

3.1 Reading my own data

Import your data into R. Check that you have the correct number of rows and columns, column names are in place, the encoding of characters looks OK, etc.

learning2019_w4 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_week4.csv", stringsAsFactors = TRUE)

3.2

Print the structure/glimpse/summary of the data. Outline briefly what kind of variables you have and if there are any missing or abnormal values. Make sure that each variable has the right class (numeric/character/factor etc).

learning_w4 <- learning2019_w4 %>%
  mutate(studylength = as.numeric(studylength),
         writingcourse = as.numeric(writingcourse))
str(learning_w4)
## 'data.frame':    206 obs. of  10 variables:
##  $ 锘縞luster     : int  3 2 1 1 3 1 2 2 1 3 ...
##  $ unref          : num  4 2 3 2 3 2.67 1 2.33 3 3.67 ...
##  $ deep           : num  3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
##  $ orga           : num  3.33 3 4.33 3.67 2.67 4 2.33 3.33 4 3.67 ...
##  $ blocks         : num  3.33 3.67 3.67 3 3.67 4 2.67 2.33 3.33 2.67 ...
##  $ procrastination: num  3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
##  $ gender         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ studentstatus  : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ studylength    : num  39 51 3 3 15 3 3 3 3 3 ...
##  $ writingcourse  : num  2 3 4 0 0 11 0 0 44 35 ...

Exercise 4 Counting observations by grouping variables

Pick a few (2-5) variables of interest from your data (ideally, both categorical and numerical).

For categorical variables, count the observations in each category (or combination of categories). Are the frequencies balanced?

learning19_w4 %>%
  count(cluster, gender) %>%
  arrange(desc(n)) %>%
  arrange(cluster)

Error: Must group by variables found in .data. * Column cluster is not found. Neither is learning19_w4[1]. Well… I’m not very angry.

For numerical variables, compute some summary statistics (e.g. min, max, mean, median, SD) over the whole dataset or for subgroups. What can you say about the distributions of these variables, or possible group-wise differences?

Overall:

summary(learning_w4)
##    锘縞luster        unref            deep            orga      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.670   1st Qu.:3.750   1st Qu.:2.670  
##  Median :2.000   Median :2.000   Median :4.000   Median :3.330  
##  Mean   :1.718   Mean   :2.178   Mean   :4.007   Mean   :3.411  
##  3rd Qu.:2.000   3rd Qu.:2.670   3rd Qu.:4.500   3rd Qu.:4.000  
##  Max.   :3.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      blocks      procrastination     gender      studentstatus  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.500   1st Qu.:1.000   1st Qu.:2.000  
##  Median :2.670   Median :3.250   Median :2.000   Median :2.000  
##  Mean   :2.655   Mean   :3.212   Mean   :1.714   Mean   :1.767  
##  3rd Qu.:3.330   3rd Qu.:3.750   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :5.000   Max.   :5.000   Max.   :2.000   Max.   :2.000  
##   studylength     writingcourse   
##  Min.   :  2.00   Min.   : 0.000  
##  1st Qu.:  5.00   1st Qu.: 0.000  
##  Median : 14.00   Median : 3.000  
##  Mean   : 19.75   Mean   : 6.694  
##  3rd Qu.: 28.00   3rd Qu.: 6.000  
##  Max.   :172.00   Max.   :91.000

For subgroups:

**Note:" I do not believe the mean values of subgroups divided by gender or student status(Bechelor/Master) could be equal. What’s wrong?

grouped_df <- learning_w4 %>%
  group_by(studentstatus)

grouped_df %>%
  summarise(unref_mean = mean(learning_w4$unref), deep_mean = mean(learning_w4$deep), orga_mean = mean(learning_w4$deep))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
##   studentstatus unref_mean deep_mean orga_mean
##           <int>      <dbl>     <dbl>     <dbl>
## 1             1       2.18      4.01      4.01
## 2             2       2.18      4.01      4.01

We can see studylength (how many month students have been studied in the university) is a better grouping value than (numbers) of writingcourse. But …

Try cluster (student profile based on the combination of scores on ‘unref’, ‘deep’ and ‘orga’)

learning_w4 %>%
  count(learning_w4[1])
##   锘縞luster  n
## 1          1 94
## 2          2 76
## 3          3 36
grouped_learning <- learning_w4 %>%
  group_by(learning_w4[1])

grouped_learning %>%
  summarise(unref_mean = mean(grouped_learning$unref), deep_mean = mean(grouped_learning$deep), orga_mean = mean(grouped_learning$orga))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
##   锘縞luster unref_mean deep_mean orga_mean
##        <int>      <dbl>     <dbl>     <dbl>
## 1          1       2.18      4.01      3.41
## 2          2       2.18      4.01      3.41
## 3          3       2.18      4.01      3.41
# the results look strange but I do not know what went wrong

Exercise 5

Describe if there’s anything else you think should be done as “pre-processing” steps (e.g. recoding/grouping values, renaming variables, removing variables or mutating new ones, reshaping the data to long format, merging data frames together).

Do you have an idea of what kind of relationships in your data you would like to visualise and for which variables? For example, would you like to depict variable distributions, the structure of multilevel data, summary statistics (e.g. means), or include model fits or predictions?

5.1 Reading the data

Structure of the data

learning2019 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_w4.csv", stringsAsFactors = TRUE) 
learning19 <- learning2019[1:13]
str(learning19)
## 'data.frame':    211 obs. of  13 variables:
##  $ 锘縞luster     : int  3 2 1 1 3 1 2 2 1 3 ...
##  $ unref          : num  4 2 3 2 3 ...
##  $ deep           : num  3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
##  $ orga           : num  3.33 3 4.33 3.67 2.67 ...
##  $ blocks         : num  3.33 3.67 3.67 3 3.67 ...
##  $ procrastination: num  3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
##  $ perfectionism  : num  3.67 3.33 3.33 2.67 2.33 ...
##  $ innateability  : num  1 1.5 3 1.5 2.5 2 2 1 2.5 1 ...
##  $ ktransforming  : num  4 3.67 3.67 3.33 4 ...
##  $ productivity   : num  1.25 2 1.25 2.25 2.25 2.5 3 2.25 2.25 3.75 ...
##  $ gender         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ studentstatus  : int  1 1 1 1 1 1 1 1 1 3 ...
##  $ studylength    : int  39 51 3 3 15 3 3 3 3 3 ...

The aim of the study is to investigate the interrelationships between the approaches to learning and conceptions of academic writing among international university students. Altogether 218 international students of the university participated in the study in 2018 and 2019. Students were divided into homogeneous groups based on their Z scores on the three approaches to learning. Then we compare mean differences and ANOVA results between the profiles.

The data ‘learning2019’ consists of 218 observations and 17 variables. It contains their scores of approaches to learning (different ways that students process information: unreflective studying, deep approach to learning and organised studying), conceptions of academic writing (blocks, procrastination, perfectionism, innate ability, knowledge transforming and productivity), and some background information (categorical variables, eg:gender, age, faculty, student status and study length).

The explanation of some columns are as follows. Each of them was average value of 2-4 questions in 5-point Likert scale (1= totally disagree, 5 = fully agree).

  • “unref”: relying on memorisation in the learning process, lacking the reflective approach to studying and applying the fragmented knowledge base.

  • “deep”: comprehending the intentional content, using evidence and integrating with previous knowledge.

  • “orga”: time management, study organisation, effort management and concentration.

  • “blocks”: the inability to write productively whose reason is not intellectual capacity or literary skills.

  • “procrastination”: failing to start or postponing tasks like preparing for exams and doing homework.

  • “perfectionism”: setting overly high standards, pursuing flawlessness, and evaluating one’s behavior critically.

  • “innateability”: writing is a skill which “is determined at birth” or “cannot be taught or developed”.

  • “ktransforming”: (knowledge transforming) using writing for developing knowledge and generating new ideas and in the reflective and dialectic processes.

  • “productivity”: (sense of productivity) part of self-efficacy in writing.

5.2 Exploring the data numerically and graphically

5.2.1 Summaries of the variables

summary(learning19)
##    锘縞luster        unref            deep            orga      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.667   1st Qu.:3.750   1st Qu.:2.667  
##  Median :2.000   Median :2.000   Median :4.000   Median :3.333  
##  Mean   :1.716   Mean   :2.171   Mean   :4.007   Mean   :3.414  
##  3rd Qu.:2.000   3rd Qu.:2.667   3rd Qu.:4.500   3rd Qu.:4.000  
##  Max.   :3.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      blocks      procrastination perfectionism   innateability  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.500   1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.667   Median :3.250   Median :2.333   Median :1.500  
##  Mean   :2.662   Mean   :3.219   Mean   :2.556   Mean   :1.761  
##  3rd Qu.:3.333   3rd Qu.:3.875   3rd Qu.:3.333   3rd Qu.:2.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  ktransforming    productivity       gender      studentstatus  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.667   1st Qu.:1.875   1st Qu.:1.000   1st Qu.:3.000  
##  Median :4.000   Median :2.500   Median :2.000   Median :4.000  
##  Mean   :4.041   Mean   :2.487   Mean   :1.716   Mean   :3.185  
##  3rd Qu.:4.667   3rd Qu.:3.250   3rd Qu.:2.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :4.750   Max.   :2.000   Max.   :4.000  
##   studylength    
##  Min.   :  2.00  
##  1st Qu.:  5.00  
##  Median : 14.00  
##  Mean   : 21.63  
##  3rd Qu.: 28.50  
##  Max.   :172.00

5.2.2 Relationships between the variables

Calculate and print the correlation matrix

cor_matrix<-cor(learning19[2:10]) %>% round(digits = 2)
cor_matrix
##                 unref  deep  orga blocks procrastination perfectionism
## unref            1.00 -0.48 -0.31   0.33            0.25          0.28
## deep            -0.48  1.00  0.32  -0.27           -0.18         -0.19
## orga            -0.31  0.32  1.00  -0.22           -0.38         -0.14
## blocks           0.33 -0.27 -0.22   1.00            0.55          0.54
## procrastination  0.25 -0.18 -0.38   0.55            1.00          0.35
## perfectionism    0.28 -0.19 -0.14   0.54            0.35          1.00
## innateability    0.16 -0.11 -0.02   0.24            0.13          0.28
## ktransforming   -0.16  0.31  0.16  -0.30           -0.21         -0.25
## productivity    -0.15  0.16  0.30  -0.38           -0.46         -0.22
##                 innateability ktransforming productivity
## unref                    0.16         -0.16        -0.15
## deep                    -0.11          0.31         0.16
## orga                    -0.02          0.16         0.30
## blocks                   0.24         -0.30        -0.38
## procrastination          0.13         -0.21        -0.46
## perfectionism            0.28         -0.25        -0.22
## innateability            1.00         -0.25         0.01
## ktransforming           -0.25          1.00         0.21
## productivity             0.01          0.21         1.00

Specialized according to the significant level and visualize the correlation matrix p.mat <- cor.mtest(cor_matrix)$p

library(corrplot)
## corrplot 0.84 loaded
p.mat <- cor.mtest(cor_matrix)$p
corrplot(cor_matrix, method="circle", type="upper",  tl.cex = 0.6, p.mat = p.mat, sig.level = 0.01, title="Correlations of learning19", mar=c(0,0,1,0))

5.2.3 Creating an x-y scatter plot

learning19 %>%
  ggplot(aes(orga, procrastination, color = cluster)) + # x = orga, y = procrastination
  geom_point()

5.3 K-means clustering

5.3.1 Calculate the distances

Euclidean distance matrix

learning19_eu <- dist(learning19[2:4])
summary(learning19_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.083   1.601   1.741   2.192   6.741

5.3.2 Determine the k

set.seed(123)
k_max <- 5 # determine the number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(learning19[2:4], k)$tot.withinss}) # calculate the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line') # visualize the results

The twcss value decrease heavily from 2 - 5 clusters. The optimal number of clusters was 3.

5.3.3 Perform k-means clustering

learning19_km <- kmeans(learning19[2:10], centers = 3)

Plot the dataset with clusters

pairs(learning19[2:10], col = learning19_km$cluster)

pairs(learning19[,2:4], col = learning19_km$cluster)

pairs(learning19[,5:10], col = learning19_km$cluster)

The optimal number of clusters was 3. We got the best overview with three clusters.

5.3.4 Perform k-means on the original data

library(devtools)
library(flipMultivariates)
learning19_scaled3 <- scale(learning19[2:4])
learning19_km3 <-kmeans(learning19_scaled3, centers = 3)
cluster <- learning19_km3$cluster
learning19_scaled3 <- data.frame(learning19_scaled3, cluster)
lda.fit_cluster <- lda(cluster ~ ., data = learning19_scaled3)
lda.fit_cluster

Warning in install.packages : package ‘flipMultivariates’ is not available

but I used to run it so I kept the codes.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes3 <- as.numeric(learning19_scaled3$cluster)
plot(lda.fit_cluster, dimen = 2, col = classes3, pch = classes3, main = "LDA biplot using three clusters")
lda.arrows(lda.fit_cluster, myscale = 2)

5.3.5 3D plot

model_predictors <- dplyr::select(learning19_train, -deep2)
# check the dimensions
dim(model_predictors)
dim(lda.fit$scaling)
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

Next, install and access the plotly package.

Create a 3D plot of the columns of the matrix product.

library(plotly)
plot_ly (x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = learning19_train$deep2)
library(plot3D)

scatter3D(x = learning19$unref, y = learning19$deep, z = learning19$orga, col = NULL, 
          main = "learning19 data", xlab = "deep",
          ylab ="unref", zlab = "orga")

library(plotly)
plot_ly (x = learning19$unref, y = learning19$deep, z = learning19$orga, type= 'scatter3d', mode='markers', color = learning19$deep)